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Abstract 

We consider the correlated multiarmed bandit (MAB) problem in which the rewards associated with each arm are modeled by a 
multivariate Gaussian random variable, and we investigate the influence of the assumptions in the Bayesian prior on the performance 
of the upper credible limit (UCL) algorithm and a new correlated UCL algorithm. We rigorously characterize the influence of 
accuracy, confidence, and correlation scale in the prior on the decision-making performance of the algorithms. Our results show 
how priors and correlation structure can be leveraged to improve performance. 
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1. Introduction 

MAB problems |[T| are a class of resource allocation prob¬ 
lems in which a decision-maker allocates a single resource by 
sequentially choosing one among a set of competing alterna¬ 
tive options called arms. In the so-called stationary MAB prob¬ 
lem, a decision-maker at each discrete time instant chooses an 
atm and collects a reward drawn from an unknown stationary 
probability distribution associated with the selected arm. The 
objective of the decision-maker is to maximize the total ex¬ 
pected reward aggregated over the sequential allocation pro¬ 
cess. These problems capture the fundamental trade-off be¬ 
tween exploration (collecting more information to reduce un¬ 
certainty) and exploitation (using the current information to 
maximize the immediate reward), and they model a variety of 
robotic missions including search and surveillance. 

Recently, there has been significant interest in Bayesian al¬ 
gorithms for the MAB problem ||3[3]|4]|3. Bayesian methods 
are attractive because they allow for incorporating prior knowl¬ 
edge and spatial structure of the problem through the prior in 
the inference process. 

In this paper, we investigate the influence of the prior on 
the performance of a Bayesian algorithm for the MAB problem 
with Gaussian rewards. 

MAB problems became popular following the seminal pa¬ 
per by Robbins ||6l and gathered interest in diverse areas in¬ 
cluding controls cm, robotics isiiioiini, machine learn¬ 
ing ifT^ [T3]| . economics llT4l . ecology ifTSl [Tbll . and neuro¬ 
science uniiisi- Much recent work on MAB problems fo¬ 
cuses on a quantity termed cumulative expected regret. The 
cumulative expected regret of a sequence of decisions is the 
cumulative difference between the expected reward of the op¬ 
tions chosen and the maximum possible expected reward. In a 
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ground-breaking work, Lai and Robbins nsi established a log¬ 
arithmic lower bound on the expected number of times a sub- 
optimal arm needs to be sampled by an optimal policy in a fre- 
quentist setting, thereby showing that cumulative expected re¬ 
gret is bounded below by a logarithmic function of time. Their 
work established the best possible performance of any solution 
to the standard MAB problem. They also developed an algo¬ 
rithm based on an upper confidence bound on estimated reward 
and showed that this algorithm achieves the performance bound 
asymptotically. 

In the following, we use the phrase logarithmic regret to refer 
to cumulative expected regret being bounded above by a loga¬ 
rithmic function of time, i.e., having the same order of growth 
rate as the optimal solution. 

In the context of the bounded MAB problem, i.e., the MAB 
problem in which the reward is sampled from a distribution 
with a bounded support, Auer et al. EOl developed upper con¬ 
fidence bound-based algorithms that achieve logarithmic regret 
uniformly in time; see IIETI for an extensive survey of upper 
confidence bound-based algorithms. 

Bayesian approaches to the MAB problem have also been 
considered. Srinivas et al. 0 developed asymptotically op¬ 
timal upper confidence bound-based algorithms for Gaussian 
process optimization. Agrawal and Goyal ||4l|22l showed that a 
Bayesian algorithm known as Thompson sampling is near- 
optimal for binary bandits with a uniform prior. Liu and Li 1241 
characterize the sensitivity of the performance of Thompson 
sampling to the assumptions on prior. Kaufman et al. El de¬ 
veloped a generic Bayesian upper confidence bound-based al¬ 
gorithm and established its optimality for binary bandits with a 
uniform prior. 

Reverdy et al. 0 Studied the Bayesian algorithm proposed 
in 0 in the case of cori'elated Gaussian rewards and analyzed 
its performance for uninformative priors. They called this al¬ 
gorithm the upper credible limit (UCL) algorithm and showed 


Preprint submitted to Elsevier 


July 9, 2015 





that the UCL algorithm models human decision-making in the 
spatially-embedded MAB problem. We dehne a spatially- 
embedded MAB problem as an MAB problem in which the 
arms are embedded in a metric space and the correlation co¬ 
efficient between arms is a function of distance between them. 
For example, in the problem of spatial search over an uncertain 
distributed resource field, patches in the environment can be 
modeled as spatially located alternatives and the spatial struc¬ 
ture of the resource distribution as a prior on the spatially cor¬ 
related reward. This is an example of a spatially-embedded 
MAB problem. It was observed in jSl that good assumptions 
on the correlation structure result in significant improvement of 
the performance of the UCL algorithm, and these assumptions 
can successfully account for the better performance of human 
subjects. 

In this note we rigorously study the influence of the assump¬ 
tions in the prior on the performance of the UCL algorithm for 
a MAB problem with Gaussian rewards. Since the UCL algo¬ 
rithm models human decision-making well, the results in this 
paper help us identify the set of parameters in the prior that ex¬ 
plain the individual differences in performance of human sub¬ 
jects. The major contributions of this work are twofold; 

First, we study the UCL algorithm with uncorrelated in¬ 
formative prior and characterize its performance. We illumi¬ 
nate the opposing influences of the degree of confidence of 
a prior and the magnitude of its inaccuracy, i.e., the gap be¬ 
tween its mean prediction and the true mean reward value, on 
the decision-making performance. 

Second, we propose and study a new correlated UCL al¬ 
gorithm with correlated informative prior and characterize its 
performance. We show that large correlation scales reduce the 
number of steps required to explore the surface. We then show 
that incorrectly assumed large correlation scales may lead to 
a much higher number of selections of suboptimal arms than 
suggested by the Lai-Robbins bound. This analysis provides in¬ 
sight into the structure of good priors in the context of explore- 
exploit problems. 

The remainder of the paper is organized in the following way. 
In Section we recall the MAB problem and an associated 
Bayesian algorithm, UCL. We analyze the UCL algorithm for 
uncorrelated informative prior and correlated informative prior 
in Section]^ and 12 respectively. We illustrate our results with 
some numerical examples in Section|2 and we conclude in Sec¬ 
tion 0 

2. MAB Problem and Bayes-UCB Algorithm 

In this section we recall the MAB problem and the Bayes- 
UCB algorithm proposed in ||2|. 

2.1. The MAB problem 

The A-armed bandit problem refers to the choice among N 
options that a decision-making agent should make to maximize 
the cumulative expected reward. The agent collects reward r, e 
M by choosing arm at each time t e {1,..., T), where T eN 
is the horizon length for the sequential decision process. In the 


so-called stationary MAB problem, the reward from option i e 
{1,..., A) is sampled from a stationary distribution p, and has 
an unknown mean m, e M. The decision-maker’s objective is to 
maximize the cumulative expected reward '«/, by selecting 
a sequence of arms {/f);e|i,..., 7 -). Equivalently, defining = 
maxjm,' | i e {1,...,A)) and R, - m,-. - m,-, as the expected 
regret at time f, the objective can be formulated as minimizing 
the cumulative expected regret defined by 

T N N 

^ R, = rm,-. - ^ miE [m,(L)] = ^ A,E [n/CT)], 

t=l i = l !=1 

where nff) is the total number of times option i has been cho¬ 
sen until time T and A, = m,. - m, is the expected regret due to 
picking arm i instead of arm i*. 

2.2. The Bayes-UCB algorithm 

The Bayes-UCB algorithm for the stationary A-armed bandit 
problem was proposed in The Bayes-UCB algorithm at 
each time 

(i) . computes the posterior distribution of the mean reward at 

each arm; 

(ii) . computes a (1 - a{t)) upper credible limit for each arm; 

(iii) . selects the arm with highest upper credible limit. 

In step (ii), the upper credible limit is defined as the least upper 
bound to the upper credible set, and the function a : N ^ (0,1) 
is tuned to achieve efficient performance. In the context of 
Bernoulli rewards, Kaufmann et al. m seta(f) = l/(f(logm, 
for some c 6 K>o, and show that for c > 5 and uninformative 
priors, the Bayes-UCB algorithm achieves the optimal perfor¬ 
mance. 

Reverdy et al. 15] [181 studied the Bayes-UCB algorithm in 
the context of Gaussian rewards with known variances. For 
simplicity the algorithm in 15] (TS) is called the UCL (upper 
credible limit) algorithm. It is shown that for an uninforma¬ 
tive prior, the UCL algorithm is order-optimal, i.e., it achieves 
cumulative expected regret that is within a constant factor of 
that suggested by the Lai-Robbins bound. It is also shown that a 
variation of the UCL algorithm models human decision-making 
in an MAB task. 

3. Uncorrelated Gaussian MAB Problem 

In this paper, we focus on the Gaussian MAB problem, i.e., 
the reward distribution p, is Gaussian with mean m, and vari¬ 
ance crj. The variance cr^ is assumed known, e.g., from previ¬ 
ous observations or known characteristics of the reward gener¬ 
ation process. We now recall the UCL algorithm and analyze 
its performance for a general prior. 

3.1. The UCL algorithm 

Suppose the prior on the mean rewards at each atm is a Gaus¬ 
sian random variable with mean vector 6 K and variance 
cTq € M>o, i € {1,..., A}. 
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For the above MAB problem, let the number of times arm 
i has been selected until time t be denoted by ni{t). Let the 
empirical mean of the rewards from arm i until time t be 
Then, the posterior distribution at time t of the mean reward at 
arm i has mean and variance 


dV" + ni{t)mi{t) , 2/ ^ 

^ 72-7T’ 

6^ + ni(t) 0 ^ + n,(f) 


respectively, where 5^ — cr^la-Q. Moreover, 




6^ + ni{t) 


and Var[yu;(f)] 


ni(t)cr] 

((52 + ■ 


The UCL algorithm for the Gaussian MAB problem, at each 
decision instance t e {1,..., T), selects an arm with the max¬ 
imum (1 - l//rf)-upper credible limit, i.e., it selects an arm 
if = argmax{2i(f) | / 6 {1,..., A^)), where 

Qi(t) = + cr,(f)d)^'(l - a,). 


<!)“' ; (0,1) ^ M is the inverse cumulative distribution function 
for the standard Gaussian random variable, a, = ^IKf, and 
K 6 K>() and a 6 K>o are tunable parameters. 

In the context of Gaussian rewards, the function Qi{t) de¬ 
composes into two terms corresponding to the estimate of the 
mean reward and the associated variance. This makes the 
UCL algorithm amenable to an analysis akin to the analysis for 
UCBl EOll . Using such an analysis, it was shown in IS) that 
the UCL algorithm with an uninformative prior and parameter 
values K — and a - 1 achieves an order-optimal perfor¬ 

mance. In the following, we investigate the performance of the 
UCL algorithm for general priors. 


3.2. Regret Analysis for uncorrelated prior 

To analyze the regret of the UCL algorithm, we require some 
inequalities that we recall in the following lemma. 

Lemma 1 {Relevant inequalities). For the standard normal 
random variable z and the associated inverse cumulative dis¬ 
tribution function the following statements hold: 

(i). for any w 6 [0, -Hex)) 


P(z > w) < 


2g-wV2 


sfhtiw + sjw^ -H 8/n) 


< -e 


- w ^/2 


2 e 

P(z > w) > ^/ - 


■ w ^/2 


W -H s/w^ -H 4 


Statement (i) in Lemma [T] can be found in Il25l . The hrst in¬ 
equality in (ii) follows from (i). The second inequality in (ii) 
was established in 0, and the last inequality can be easily ver- 
ihed using the second inequality in (ii). 

Lemma 2 (Difference of squares inequality). For any ci, C 2 € 

M such that (1 — ci)(l -H C 2 ) > L 


(x — y)^ > Cix^ — C 2 y^, for any .r,yeIR. 

Proof. The inequality follows trivially using a completing the 
square argument. □ 

Let Am,' = mj-p^, for each i e {1,..., A). Set a > |(l-H-j^), 
Cl = C 2 = for some e 6 ( 0 , 1 ). 

Theorem 3 (Regret for uncorrelated prior). For the Gaussian 
MAB problem, and the UCL algorithm with uncorrelated prior, 
the expected number of times a suboptimal arm i is selected 
satisfies 

E[n,(7’)] < T]i -H hi(T), 

where 77 ; = max{l, f^(21ogA + 2a log T) - 6^']}, and hfT) is 
defined in 0. 

Proof. See [Appendix A| □ 

Remark 4 (Regret of uncorrelated UCL algorithm). The ex¬ 
pression for hj(t) in 0 suggests that if the prior underestimates 
a suboptimal arm and overestimates the optimal arm, then n,(f) 
is a small constant (the last case in 0). Further, if cr^ is small, 
i.e., the prior is conhdent in these estimates, then a large con¬ 
stant 6^ is subtracted from the logarithmic term in 7 ;, dehned in 
Theorem]^ This leads to a substantially smaller expected num¬ 
ber of suboptimal selections £[ 77 ,(T)] for an informative prior 
compared to an uninformative prior over a short time horizon. 

If the prior underestimates the optimal arm which corre¬ 
sponds to the hrst two cases in 0, then hfT) is a large con¬ 
stant that depends exponentially on Am^Jer^ . A similar effect 
is observed if a suboptimal arm is overestimated which corre¬ 
sponds to the hrst and third case in 0. Further, if cr^ is small, 
then the reduction in expected number of suboptimal selections 
due to large 6^ in 77 , may be overpowered by the large constant 
in hi(T). Here, there exists a range of cro, for which an infor¬ 
mative prior leads to a smaller expected number of suboptimal 
selections £[ 77 ,(T)] over short time horizon compared to an un¬ 
informative prior. 

In the asymptotic limit T —> -H(x), the logarithmic term in 77 , 
dominates and both informative and uninformative priors will 
lead to a similar performance. □ 


(ii). for any a 6 [0,0.5], f e N and a > 1, 

0-'(l - a) < V-21og(a) 

0“'(1 - a) > -^-log(27ra2(l - log(27ra2))) 


O '(1 - ^ logf. 

' sjlneP 2 *2 


3a 


4. Correlated Gaussian MAB problem 

In this section, we study a new correlated UCL algorithm for 
the correlated MAB problem. We hrst propose a modihed UCL 
algorithm, and then analyze its performance. The modiheation 
is designed to leverage prior information on correlation struc¬ 
ture. 
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if Am,. > 0, Am, < 0, 
if Am,. > 0, Am, > 0, 


if Am,. < 0, Am, < 0, 
if Am,. < 0, Am, > 0. 


( 1 ) 


4.1. The correlated UCL algorithm 

Suppose the prior on the mean rewards at each arm is a mul¬ 
tivariate Gaussian random variable with mean vector //g € 
and covariance matrix Zg e 

For the above MAB problem, the posterior distribution of the 
mean rewards at each arm at time f is a Gaussian distribution 
with mean //(f) and covariance 2(f) defined by 

q(t) = ^+A(f-l)//(f-l) 

A(f) =^^+A(f-l), 2(f) = A(f)-i (2) 

//(f) = 

where tp{t) is the column A-vector with /,-th entry equal to one, 
and every other entry zero. In the following, we denote entries 
of //(f) and the diagonal entries of 2(f) by //,(f) and cr^(f), i e 
{1,..., N}, respectively. 

As in Section |TT| let ni(t) be the number of times arm i has 
been selected until time f, and m,(f) be the empirical mean of 
the rewards from arm i until time f. Then, it is easy to verify 
that 

//(f) = (Ag-HP(f)-')-'(^’W'm(f)H-Ao//g) 

A(f) =Ao + F(f)-', 

where Ag = 2 q\ P{t) is the diagonal matrix with entries 
crlltT., i € {1,...,A), and m(t) is the vector of m,(f),i e 
{I,..., A). 

The correlated UCL algorithm for the Gaussian MAB prob¬ 
lem, at each decision instance f 6 {1,...,?’), selects an arm 
with the maximum upper credible limit, i.e., it selects an arm 
if - argmax{2,(f) | / 6 {1,..., A)), where 


2,(f) = //,(f) o-,(f) 


A 


N 


2^p//f)<l)-‘(l - a,). 


f=i 


<!)“' : (0,1) ^ K is the inverse cumulative distribution function 
for the standard Gaussian random variable, a, = IfKf, pij{t) 
is the correlation coefficient between arm i and arm j at time f 
and K e IR>g and a e IR>g are tunable parameters. Note that 
for uncorrelated priors, pj/O - 1 and the correlated UCL 
algorithm reduces to the UCL algorithm. 

In the context of uninformative priors, 2,(1) = H-co for each 
i e {1,..., A), and the UCL algorithm selects each arm once 
in first A steps. In a similar vein, we introduce an initialization 
phase for the correlated UCL algorithm. 

Initialization: In the initialization phase, an arm i, defined by 

it - argmax{cr^(f - 1) | cr^(f - 1) > cr^/v, and i e {1,..., A)), 


is selected at time f. Here, v < 1 is a pre-specified positive 
constant. Let finit be the number of steps in the initialization 
phase. 

Lemma 5 (Initialization Phase). For the correlated MAB 
problem and the inference process 0, the initialization phase 
ends in at most A steps and the variance following the initial¬ 
ization phase cr?(finit) < (r^lv.for each / 6 {1, • • •, A). 

Proof. Note that to prove the lemma, it suffices to show that no 
arm will be selected twice in the initialization phase. 

It follows from the Sherman-Morrison formula for the rank-1 
update for the covariance in (|^ that 

cr^. (f — 1) 

cr^(f)^cr^(f-l)- (4) 

+ 1) 

where o-jj(t) is the i,j component of 2(f), for each i e 
{1,...,A). If if = then cr^(t) - ^ cr?- Thus, arm 

j will not be selected again in the initialization phase which 
establishes our claim. □ 


Remark 6 (Correlation Structure and Initialization). 
Lemma states that the length of the initialization phase is 
upper bounded by A. For an uninformative prior, the above 
initialization phase reduces to visiting each arm once, and the 
variance at each arm after the initialization phase is cr] (v - 1). 
In this case, the upper bound A on the number of steps in the ini¬ 
tialization phase is achieved. For an informative prior with cor¬ 
relation structure, the initialization phase may be shorter than 
A steps, i.e., not all arms need to be visited. This is because a 
visit to one arm may reduce variance in correlated arms even if 
unvisited. However, the variance at those arms not visited dur¬ 
ing the initialization phase might still be greater than cr^, i.e., 
the bound in Lemmaj^will be met but it is possible that v < 1. 
To see how variance can be reduced in arms not visited, note 
the effect of prior covariance cr^ (f - 1) on the reduction in vari¬ 
ance of an arm i + it. In particular, it follows from 0 that 


<THt) = 


crjo-? 1 )-o-? (»-1 )o-r (/-1)(1 -pr (f-1)) 


. Thus, a high value of 


correlation puft - 1) leads to substantial reduction in variance 
of arm i even when it is not selected. 

To better understand the role of correlation, consider a set 
of arms comprised of decoupled clusters of highly correlated 
arms. Consider such a cluster of arms with cardinality m. The 
initial covariance matrix for this cluster is (Tg(lml^,-i-e£’), where 
£ is a symmetric perturbation matrix with zero diagonal entries, 
1,„ is the vector of length m with all entries equal to one, and 
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0 < e «c 1. It follows that one eigenvalue of cry(l„l^ + eE) 
is CTym + Oicr^e) and other eigenvalues are 0{cr^e). In this set¬ 
ting, just one sample can significantly reduce the eigenvalue at 
cr^m + 0{cr^e). Since the largest eigenvalue of the covariance 
matrix is an upper bound on the variances, just one sample will 
reduce the uncertainty associated with the cluster substantially. 
Thus, in the initialization phase, we need a number of observa¬ 
tions equal to the number of clusters, which may be substan¬ 
tially smaller than the number of arms. 

It should also be noted that correlation plays a role only for 
short time horizons. Once each arm as been sampled suffi¬ 
ciently, then the matrix A(f) in Q is substantially diagonally 
dominant and behaves like a diagonal matrix. □ 


4.2. Regret analysis for correlated UCL algorithm 

For correlated priors, the inference equations Q yield the 
following expressions for the bias e and covariance £ of the 
estimate fx{t) 

e(t) E[/x,] - m = (Ao -H F(f)"')"'Ao(//o - m) 

m cov(//,) = (Ao + p(tr^r^ p(tr\Ao + p(tr^r\ 


where m is the vector of mean reward. 

Let crf{t) and crij{t), i,j e {1,..., A) be the diagonal and off- 
diagonal entries of S(f), and cr^it), / 6 {1,..., A) be the diagonal 
entries of £(f). 

We now analyze the properties of covariance matrices E(f) 
and £(f). Let 6 the submatrix of Sq 

obtained after excluding the i-th row and /-th column. Let 
cr,(0) € be the row vector obtained after excluding the 

/-th entry from the /-th row of Sq. We define the variance of 
arm i conditioned on the mean reward at every other arm by 

^Lnd = ^/(0)-c^K0)£-'(0K(0). 

Let = ^j/'^Lond- With a slight abuse of notation, we 

refer to nft) as the number of times arm i is selected after the 
initialization phase. We also define for each i e {1,..., A) 


Pi 


K?(l+Co„d) 


N N 


j=\ k^\ 


where is the k, j component of Aq. 

Lemma 7 {Bounds on variances). The following statements 
hold for the inference process @.- 
(i). the variance cr^{t) satisfies 


aj(t) < 
o-]{t) > 


V + nft) 


, and 


6^ , ■ 
1 -cond 


• nff) ’ 


(ii). the variance o'^(t) satisfies 


crj(t) < a-f{t) 2_^pfj{t), and 
J=i 

_2 

o-'it) > 


N 

> V , 


erf 


Proof. We start by establishing the first statement. The co- 
variance update in Q can be simplified using the Sherman- 
Motrison formula to obtain 


S(f+l) = S(f)- 


+ fit + l)^I.{t)(f>{t + ly 


(5) 


It follows that 


crf(f ■ 




^i.(0 


0-2-H 0-2(0' 


It follows that after the initialization phase 0-2(0 < v. Moreover, 
at each future round, if 4 i, then o-j{t -H 1) < crj(t); otherwise, 
o-2(f -H 1) = o-2o-2(0/(o-2 + a]{t)). The upper bound on (r]{f) 
immediately follows from this observation and the induction 
argument. 

We now establish the lower bound on erjit). Since the infer¬ 
ence process involves a stationary environment, the sequence 
in which arms are played is of no significance and the inference 
only depends on the number of times an arm has been played. 
Consequently, the inference is the same if arms are played in 
blocks. In particular, each arm j 6 {1,..., A) can be played in 
a block of size nfit). Further, any order in which these blocks 
are played leads to the same inference. 

Suppose for such a modified allocation of arms, tj is the time 
when the block associated with arm j begins. Suppose that arm 
i is played the last. Then, from Q and for the modified alloca¬ 
tion process, it follows that 


aj{tj + nj{t)) ^ajitf)- 


nyycriitj) 

0-2 H- nj(t)o-jitj) 


i.e., the posterior variance cr^{tj + nj{f)) is lower bounded by the 
conditional variance of arm i under a noise free reward from 
arm j. It follows that, for the modified allocation sequence, 
o-2(f - n,(f)) > crLond' Now, the lower bound follows from the 
variance update after the last block. 

To establish the second statement, we note that £(f) = 
2(f)F(f)“'2(f). It follows that 


N 






nj(t)o-yt)pf.{t) 


J=l 

< cr 


^ nmp) 

^ „/f) + v 


7=1 

-2/ 


cr% 


N 

>1 


where the second inequality follows from the fact crj(t) < 
cr]l{nj{t) + v). 

Similarly, 

V «/(f)o--(f) 

- 2 - - - 2 -’ 


establishing the lower bound. 


□ 
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Theorem 8 (Regret of correlated UCL algorithm). For the 

Gaussian MAE problem, and the correlated UCL algorithm, the 
expected number of times a suboptimal arm i is selected after 
the initialization phase satisfies 


E[n/(7’)] < rji + nff), 

where Tjj - max{ 1, f ^(2log K + 2a log T) -v\}, and 


5% 


r r-cond 2^^^ 

hi(T) = } 


2(3aci - 4) 

Proof. See|Appendix B| 


3aci ^ 

-e 2 + e 3» + 


3aci 

-e 2 , 


2(3aci - 4) 


□ 


Remark 9 (Regret of correlated UCL algorithm). Recall that 
the nff) in Theorem|^is the number of selections of a subop¬ 
timal arm i after the initialization phase. For an uninformative 
prior, V = 1 and each arm is selected once in the initialization 
phase. Consequently, the expression for rji will reduce to the 
expression in Theorem In the expression for hff) in Theo¬ 
rem we consider only the worst case, which corresponds to 
the first case in Q. Other cases can be considered in the spirit 
of ([T]|. However, the number of cases for a correlated prior will 
be significantly more than four, which is the number of cases 
for an uncorrelated prior. 

The correlated UCL algorithm operates in two phases. The 
benefit of the correlation structure is most pronounced in the 
initialization phase; as mentioned in Remark]^ a highly corre¬ 
lated prior helps reduce the number of initialization steps. Fur¬ 
ther, if the correlated prior is a true measure of the environment, 
then the upper bound on nfT) will be small. However, the y6,s 
are large if such a highly correlated prior is not a true measure 
of the environment, or a high confidence is placed on the priors, 
i.e., the initial variances are small and the mean rewards in the 
prior are far from the true mean rewards at the arms. Large y6,s 
may lead to a large constant in the upper bound on nff). □ 


5. Numerical Illustrations 

In this section, we illustrate the results of the preceding two 
sections with data from numerical simulations. The theoretical 
results pertain to different quality priors defined by how rich is 
the information they can capture about the rewards associated 
with the bandit. Uninformative priors capture no information, 
while uncorrelated informative priors capture beliefs about in¬ 
dividual arms. Correlated (informative) priors add to uncor¬ 
related informative priors the ability to capture beliefs about 
the relationship between different arms, which we leverage in 
our new correlated UCL algorithm. When an informative prior 
models the environment well, we refer to it as a well-informed 
prior; conversely, if the prior models the environment poorly, 
we refer to it as ill-informed. 

As in Q, our simulations focus on the case of a spatially- 
embedded bandit problem, for which a showed that correlated 
priors can lead to higher performance. The simulations show 


that, among well-informed priors, those with richer information 
content result in higher performance. Theorems andallow 
us to quantify the extent to which a prior is well-informed. 

We consider here the spatially-embedded bandit problem 
studied in a. The reward surface is relatively smooth with 
regions of both high and low rewards. This means that a cor¬ 
related prior capturing length scale information can improve 
performance. The mean reward value is equal to 30, and the 
sampling variance for each arm is cr^ = 10. 

Figure[T]shows simulations from cases where the informative 
priors are well-informed. Mean cumulative regret computed 
from an ensemble of 100 simulations is shown for three priors; 
an uninformative prior, an informative uncorrelated prior, and 
an informative correlated prior. For all the simulations, the pa¬ 
rameter e was set equal to 1/ VTO ^ 0.316, and for correlated 
priors the parameter v was set equal to 1. The informative pri¬ 
ors have an initial mean belief //g with a higher value (equal to 
100) in regions with high rewards, and a lower value of zero 
elsewhere. The uncorrelated prior sets Cg = 10 = cr^, meaning 
the prior represents the equivalent of a single prior observation. 
The correlated prior sets cr^(O) = 10 as in the uncorrelated case, 
and uses a correlation structure representing an exponential ker¬ 
nel as in Q. This kernel encodes the information that the closer 
two arms are in the embedding space, the more correlated are 
their rewards. 

The richer information provided by the informative priors 
results in better performance in this case where the priors are 
well-informed; the informative correlated prior results in less 
regret than the informative uncorrelated prior, which in turn re¬ 
sults in less regret than the uninformative prior. For short hori¬ 
zons, the informative priors result in cumulative regret which 
is less than the Lai-Robbins lower bound. The UCL algorithm 
and the correlated UCL algorithm can violate the lower bound 
because of the additional information provided by the priors, 
which effectively shifts the regret curve leftwards. Asymp¬ 
totically, however, the algorithms will tend to match the Lai- 
Robbins regret rate for any prior. 

In contrast, Figurej^shows simulations from cases where the 
informative priors are variously ill-informed. Mean cumulative 
regret computed from an ensemble of 100 simulations is shown 
for three increasingly informative priors, as in Figure The 
informative priors have an initial mean belief/ig that is uniform 
with each element = 30. As in Figure flj the uncorrelated 
prior sets cTg = 10 = cr^, meaning the prior represents the 
equivalent of a single prior observation. The correlated prior 
sets cr?(0) =10 and uses a correlation structure that again rep¬ 
resents an exponential kernel but with a longer length scale to 
represent a smoother reward surface. 

Although the informative priors accurately represent the 
overall mean value of the reward surface, they fail to capture the 
spatial heterogeneity of the reward surface, in particular the fact 
that it has high- and low-value patches. Therefore, both infor¬ 
mative priors are ill-informed about the mean rewards and the 
informative uncorrelated prior results in much poorer perfor¬ 
mance than the uninformative prior for moderate task horizons. 
However, by adding the correlation structure to the ill-informed 
uncorrelated prior, we can recover much of the performance ex- 
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hibited by tbe well-informed correlated prior of Figure [T] In a 
spatially-embedded task like tbe one studied here, information 
about correlation structure among arms can be as valuable as 
accurate information about tbe value of individual arms. 



Figure 1: Well-informed priors. Increasing the amount of information given 
increases performance. The traces sho\v mean cumulative regret from 100 sim¬ 
ulations for each of three different priors that model increasingly rich informa¬ 
tion about the rewards: the uninformative prior provides no information, the 
informative uncorrelated prior provides information about rewards associated 
to individual arms, and the informative correlated prior adds information about 
the relationship between rewards associated with different arms. When used 
with an uninformative prior, the algorithm must begin by sampling each arm 
once in what is effectively an initialization phase. Upon completing this phase 
the algorithm can sample arms more selectively which makes the regret grow 
more slowly, as can be seen in the bend in the curve at r = 100. Because of 
the additional information provided by the informative priors, the algorithms 
can sample arms more selectively from the initial time ? = 1, which results in 
better performance than the uninformative prior and allows the algorithms to 
outperform the Lai-Robbins bound on regret. 


6. Conclusions and Future Directions 

In tbis note we studied and modified the UCL algorithm for 
the correlated MAB problem with Gaussian rewards. We in¬ 
vestigated the influence of the assumptions in the prior on the 
performance of the UCL algorithm and the new correlated UCL 
algorithm. We characterized scenarios in which the informative 
priors perform better than the uninformative prior and charac¬ 
terized the improvement in the performance in terms of cumu¬ 
lative regret. In particular, we showed conditions in which an 
informative correlated prior can be leveraged to significantly 
reduce cumulative regret. 

There are several possible avenues of future research. First, 
we considered that the environment is stationary. An interest¬ 
ing future direction is to consider non-stationary environments 
in which the reward at each arm may be time-varying and the 
autocorrelation scale may be known. Second, we considered 
these problems for a single player. Many application scenarios 
involve a group of individuals and it is of interest to study col¬ 
laborative and competitive multiplayer versions of these prob¬ 
lems. 



t 


Figure 2: Ill-informed priors. Increasing the amount of information given can 
decrease performance. As in Figure^ the traces show mean cumulative regret 
from 100 simulations for each of three dilferent priors. Again the algorithms 
exhibit an initialization phase behavior for the uninformative and informative 
correlated priors, whose end can be seen in the bends in the regret curves near 
t = 100. The ill-informed correlated prior improves performance relative to 
the uninformative prior although not quite as much as the well-informed cor¬ 
related prior does in Figure^ In contrast, the ill-informed uncorrelated prior 
significantly decreases performance relative to all other priors. By encoding a 
strong incorrect belief about the rewards, this prior requires multiple samples 
of suboptimal aims to learn that they are suboptimal. This appears in the regret 
curve as an initialization phase that lasts until t = 4,500, at which point the 
mean cumulative regret is approximately 35,000. 
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Appendix A. Proof of Theorem]^ 

In the spirit of ll20l . we bound nff) as follows: 

T 

nff) = = 0 

f=l 

1 =\ 

T 

+ I(Q' > - 1) > J]^ , 

/=! 


where rji is some positive integer and I{x) is the indicator func¬ 
tion, with I{x) - \ \f XK true statement and 0 otherwise. 

At time f, the agent picks option i over i* only if 

Q', < Q'i- 
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This is true when at least one of the following equations holds: 


<mi.-Ci'(t) (A.l) 

/U/(f) > Mi + Ci(t) (A.2) 

TO/. < TO/ + 2C/(f) (A. 3) 


where C/(f) = 


y6^+ni{f) 


O '(1 - a,) and a, = IjKf. Otherwise, 


if none of the equations (A. 1 i-(|A.3|l holds. 


2/.(f) = /U/.(0 -H C/.(f) > TO/. 

> TO/ + 2C/(f) > jU/(f) + Ci(t) = Qi(t), 


and option i* is picked over option i at time t. 

As noted earlier, the posterior mean jU/(f) is a Gaussian ran¬ 
dom variable: 


m ~ N 


6^iu° + ni{t)mi 
6^ + tiiif) 


ni{t)cr] 

’ (d2 -H n/(f))2 


We will now analyze the events ( |A.l| l, ( |A.2| ), and ( |A.3| ). Let 
Pi(f) be the probability of the event ( |A.l| i. 

Lemma 10 {Probability of event ( |A.l| i). The following state¬ 
ments hold for event ( |A.l| i.' 

(i). if Am< 0, then 


T 


t=i 


< 


a 

K{a-iy 


(ii). if Ami* > 0, then 


2Am2 

Pi(f) < max {e , e } 


3aci 




f=l 


2(3aci - 4) 


Proof. For niff) > 1, event (|A.1[) is true if 


TO/. > /U/.(f) H- 


TO/. -/i/.(f) > 


CTs 


-H nft) 

CTs 

sjfi -H nft) 


O-fl-af 


O-fl-af 


■ z< - ' 


/«/.(?)-H (5^ „ X , <5^ 


«/.(f) 


-<l)-‘(l-a,) 


CTj s/niff)’ 


where z ~ Af(0,1) is a standard normal random variable. 

Similarly, for «,.(/) = 0, event is not true if (i) Ato/. < 0, 
or (ii) Ato/. > 0 and 0“*(1 - a,) > Ato,-./cto. 

We now establish the first statement. If Ato,. < 0 and «,.(?) = 
0, then Pi(f) = 0. If Ato,. < 0 and «,.(?) > 1, then 

Pi(f) < P(z > 0-'(l - a,) - 

s 

<P(z>0-'(l-«/)) = a,. 


Therefore, 


2p,w<2 


/=! 


1=1 


1 

K? 


^ 1 ^ 1 

K{a - 1) 


K{a - 1) 

























To establish the second statement, we note that if Am,. > 0 
and ni>{f) - 0, then event (|A.1[) does not hold if 


O ( 1 -a,) > ^/ —logf > - 

I (To 


t > e 


lAmr, /3fl(TQ 


If Am,-. > 0 and ni.{t) > 1, then Pi(f) < P(z > ^), where 

^_ 2<S^Am^^ 

log t - Note that ^ > 0, if f > . Dehne 

2(5^ AfH^* 2Am2 

t\ - maxje ,e^°°o |. 

It follows that for t > t’', 

Pi(f) < 


2 
1 

< - exp 


1 / 1/ /3a, d'^Am,-.\ 2 n 

l/3aci C 2 ( 5 '*Am/. 


(- 


)) 


1 ^ 3aci 

■=. —p 2£rJ f 4 

2 


where the second last inequality follows from Lemmaj^and c\ 
and C 2 are as dehned in Section |3/^ 

Therefore, 


where z ~ Af(0,1) is a standard normal random variable. 

We start with establishing the hrst statement. If Am,- < 0 and 
n,(f) > rji, then 

PaCO < P(z > <I>^'(1 - a,) + 

< P(z < 0, 


where^=^/^+^^. 


2,5^ A/«? 


It follows that ^ > 0, if f > fl e . It follows that for 


f>f 


P2(0 < 


1 / 1 / / 3a , Am,' \ 2 >, 


5exp(-i( 


2^2 

l/3acilogf C 2 ( 5 '*Am/ 

2 ^ Y 




)) 


•^rS^Ajn? 


— —P f 4 

2 


where the second last inequality follows from Lemma|^ There¬ 
fore, 


2Pi(f)<fI + Z 


?=1 


/=1 


-I c^o AmT* 

1 ^ ■ _3^ 

_ ^ 2cr;^ # 4 

2 


<f 


3aci 


2(3aci - 4) 




□ 

Let P 2 (f) be the joint probability of the event ( |A.21 i and the 
event «,■(?) > 77 ,■, for some 77 ,- e N. 

Lemma 11 {Probability of event ( |A.2| i). The following state¬ 
ments hold for event ( |A.2| i.' 

(i). if Am, < 0, then 




2(5^A777? 


Am? 


—3aci 

< p 'ia(TjT]j _|_ -^ 20-^777 

2(3aci - 4) 


(ii). if Anti > 0 , then 


2P2(0< 


ir(a - 1) 


Proof. The event ( |A.2[ ) holds if 

n 

m,' < fii(t) - 


■ - m > 


+ niit) 
o-s 


0-1(1-a,) 


sjS^ + niif) 


0-1(1-a,) 




ni{t) 


CTs sjnft) ’ 


1 C9(5^A777? 

I . ^ 3ari 


^Pi(f) ^ 4 + ^ t “ 


<L + 


3aci 


2 2(3aci - 4) 


C2i5^A777? 


The second statement follows similarly to the first statement in 
Lemma [TOl □ 

We now analyze the probability of event (|A.3|l. 


7«,'. < nji H- 


2cr., 


■\/<5^ + ni{t) 


0-i(l-a,) 


A, < 


2cr, 


sjd^ + 77,'(f) 


0-i(l-a,) 


2 

^(d^ H-77,(0) < -2 log or, 

2 

-—^(d^ + niit)) < 2 log K 2a log t 
4o-2 

2 

-—^(d^ + ni{t)) < 2 log K + 2a log T 
4crt 


(A.4) 


(A.5) 


where A, = 777 ,. - 7 «,, the inequality ( |A.4[ l follows from 
Lemma[2 and the inequality ( |A.5| l follows from the monotonic¬ 
ity of the logarithmic function. Therefore, the event ( |A.3| l is not 
true if 


4cr2 


77/(0 ^ —s^(2 log K + 2a log T)- 5^. 
\ ^ 
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Setting T]i = max{l, \"^{2\ogK + 2a\ogT) - 6^']}, we get Thus, for n,.(f) = 0, event ( |B.1| ) does not hold if 




E [nf ] < T]i + ^ P(gJ > Q‘., riiit - 1) > rjd 

t=l 

T 

= ?7/ + ^ (Pl(0 + ^2(0) 

/=1 

< J]i + hiit). 

This completes the proof of the theorem. 

Appendix B. Proof of Theorem 

Similar to the proof of Theorem]^ at time f, the agent picks 
option i over /* only if Q\, < Q'.. This is true when at least one 
of the following equations holds; 


PiAO < mi. - CiAt) 

(B.l) 

Piit) > mi + Ciit) 

(B.2) 

nii. < rrii + 2C;(f) 

(B.3) 


where C,(f) = cr,(f) p]p)^ Hi - at), a, = 1/Af“. 

For riiit) > 1 and ni>{t) > 1, equations (|B.l|l and (|B.2|l reduce 


to 


z > 


z > 


Mt) ^i:liPi(t) 

-;-d) *(1 “ at) + - -, and 

cr,-.(f) cr;.(f) 

riiO^ZLplit) 

-d) (1 - a,) 


cr,(f) ‘ '* cr,(f)’ 

respectively, where etit) = 2^=1 2f=i o-ik(t)Alj(fi(^ - mj). 
It follows that, for rifit) > 1, 

Ie,,(f)| 2f=l cr,-.(f)cr^(f)|T^H^0 “ ^j\ 


O-iAt) 


-i;=l ^k=l ' 




< cr . 


< <Ts 


^fn^Jyvo-iAt) 

+4-con, ^ 


ZZI 4 IK 




l+6l , ^ ^ 

' I -cond 


^ 7=1 k=l 

For n,.(f) = 0, event (|B.1[) does not hold if 




-I 1 3a 

O-;. (f)d) (1 - a,) > O-,-. y y log f 

2 N N 

2=1 lr=l 
> \eiAt)\. 


It follows using the same argument as in Theoremj^that 

1 


'*-cond 26 ^ 


^ P(event ( |B.1[ )) < max |e’'‘'^'’fi-cond’, e 3« 


3aci 


2(3aci - 4) 


Similarly, 

T 2fi 

^P(event ( |B.2| i, w,(f) > 1) < + 

f=i 

Also, event (|B.3|) is not true if 


3aci f2^ 
-e - 


2(3aci - 4) 


4cr^ 

”‘'H) > ^(2 log A + 2a log T) - v. 

/y: 

I 

Adding the probabilities of the events ( |B.1 [ i-( |B 3| ), we obtain 
the desired expression. 
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